03:00
Definition: A collection of random variables in a vector
\[\mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}\]
Linearity property:
\[E[\mathbf{A}\mathbf{Y} + \mathbf{b}] = \mathbf{A}E[\mathbf{Y}] + \mathbf{b}\]
Why this works: Expectation distributes over linear combinations
Key assumption: \(\mathbf{A}\) is constant, not random
Definition:
\[\text{Var}(\mathbf{Y}) = E[(\mathbf{Y} - E[\mathbf{Y}])(\mathbf{Y} - E[\mathbf{Y}])^T]\]
This creates an \(n \times n\) matrix
\[\text{Var}(\mathbf{Y}) = \begin{bmatrix} \text{Var}(Y_1) & \text{Cov}(Y_1, Y_2) & \cdots \\\text{Cov}(Y_2, Y_1) & \text{Var}(Y_2) & \cdots \\\vdots & \vdots & \ddots\end{bmatrix}\]
Diagonal: individual variances
Off-diagonal: covariances between pairs
Key transformation rule:
\[\text{Var}(\mathbf{A}\mathbf{Y} + \mathbf{b}) = \mathbf{A}\text{Var}(\mathbf{Y})\mathbf{A}^T\]
Note: Constants \(\mathbf{b}\) don’t affect variance! But constant \(\mathbf{A}\) does.
Think of it as: \((\mathbf{A} \times \text{variability} \times \mathbf{A}^T)\)
Matrix \(\mathbf{A}\) transforms the variables
Variance gets “stretched” by \(\mathbf{A}\) on both sides
Given: \(\mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \end{bmatrix}\) with \(\text{Var}(\mathbf{Y}) = \begin{bmatrix} 4 & 1 \\ 1 & 9 \end{bmatrix}\)
Find: \(\text{Var}(2Y_1 + 3Y_2)\)
03:00
Express as: \(\mathbf{A}\mathbf{Y}\) where \(\mathbf{A} = [2, 3]\)
Apply formula:
\[\text{Var}(2Y_1 + 3Y_2) = \mathbf{A}\text{Var}(\mathbf{Y})\mathbf{A}^T\]
\[[2, 3] \begin{bmatrix} 4 & 1 \\ 1 & 9 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \end{bmatrix}\]
\[= [2, 3] \begin{bmatrix} 11 \\ 29 \end{bmatrix} = 109\]
The model:
\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]
We will call this “ordinary least squares” (OLS)
Among all linear, unbiased estimators of \(\boldsymbol{\beta}\)…
Which one has the smallest variance?
Assumption 1: Linearity
The model is \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\)
Assumption 2: Zero mean errors
\(E[\boldsymbol{\varepsilon}] = \mathbf{0}\)
Assumption 3: Constant variance & independence
\(\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}\)
Assumption 4: Full rank
\(\mathbf{X}\) has full column rank (no perfect multicollinearity)
Homoscedasticity: All errors have same variance \(\sigma^2\)
\(\text{Var}(\varepsilon_i) = \sigma^2 \text{ for all } i\)
Independence: Errors are uncorrelated
\(\text{Cov}(\varepsilon_i, \varepsilon_j) = 0 \text{ for } i \neq j\)
Linear estimator: \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\)
where \(\mathbf{C}\) doesn’t depend on \(\mathbf{y}\)
Unbiased: \(E[\tilde{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)
OLS estimator:
\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
Any weighted average of the data
Theorem: Under GM assumptions, OLS is BLUE
BLUE = Best Linear Unbiased Estimator
“Best” = smallest variance
Step 1: Show OLS is unbiased
Step 2: Find variance of OLS
Step 3: Show any other linear unbiased estimator has larger variance
We want to show: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)
Start with the OLS formula:
\[E[\hat{\boldsymbol{\beta}}] = E[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}]\]
Replace \(\mathbf{y}\) with \(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\):
\[E[\hat{\boldsymbol{\beta}}] = E[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon})]\]
Multiply through: \[= E[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon}]\]
Note that: \((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X} = \mathbf{I}\)
So we get: \[= E[\boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon}]\]
Expectation of a sum = sum of expectations: \[= \boldsymbol{\beta} + E[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon}]\]
In classical regression: \(\beta\) is a fixed but unknown parameter
Randomness comes from: \(\varepsilon\) (and therefore y), not from \(\beta\)
This is why: we can pull \(\beta\) out of expectations like a constant
Since \(\mathbf{X}\) is fixed (not random): \[= \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E[\boldsymbol{\varepsilon}]\]
From Assumption 2: \(E[\boldsymbol{\varepsilon}] = \mathbf{0}\)
Therefore: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{0} = \boldsymbol{\beta}\)
We have shown: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)
OLS is unbiased under GM assumptions
Calculate \(\text{Var}(\hat{\boldsymbol{\beta}})\) where:
\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
08:00
From our unbiasedness proof, we found: \[\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon}\]
Since \(\boldsymbol{\beta}\) is constant (has zero variance): \[\text{Var}(\hat{\boldsymbol{\beta}}) = \text{Var}((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon})\]
Let \(\mathbf{A} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\), then: \[\text{Var}(\mathbf{A}\boldsymbol{\varepsilon}) = \mathbf{A}\text{Var}(\boldsymbol{\varepsilon})\mathbf{A}^T\]
From Assumption 3: \(\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}\)
Therefore: \[= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \cdot \sigma^2\mathbf{I} \cdot \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\]
Pull out the scalar \(\sigma^2\): \[= \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\]
Final answer: \[\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\]
Goal: Show any other linear unbiased estimator has larger variance
Strategy: Consider any linear unbiased estimator \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\)
For \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) to be unbiased: \[E[\mathbf{C}\mathbf{y}] = E[\mathbf{C}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon})] = \mathbf{C}\mathbf{X}\boldsymbol{\beta}\]
This must equal \(\boldsymbol{\beta}\) for any value of \(\boldsymbol{\beta}\)
Therefore we need: \(\mathbf{C}\mathbf{X} = \mathbf{I}\)
Write any unbiased \(\mathbf{C}\) as: \(\mathbf{C} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}\)
where \(\mathbf{D}\) is some matrix
Intuition: Any estimator = OLS + some deviation
Key insight: The deviation \(\mathbf{D}\) can only add variance, never reduce it
Mathematical power: Separates what we know (OLS) from the unknown part
Check the constraint \(\mathbf{C}\mathbf{X} = \mathbf{I}\): \([(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}]\mathbf{X} = \mathbf{I} + \mathbf{D}\mathbf{X}\)
For the constraint to hold, we need: \[\mathbf{D}\mathbf{X} = \mathbf{0}\]
This is the key restriction on \(\mathbf{D}\)
Start with: \[\text{Var}(\tilde{\boldsymbol{\beta}}) = \text{Var}(\mathbf{C}\mathbf{y}) = \mathbf{C}\text{Var}(\mathbf{y})\mathbf{C}^T\]
Since \(\text{Var}(\mathbf{y}) = \text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}\): \[= \sigma^2\mathbf{C}\mathbf{C}^T\]
Replace \(\mathbf{C} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}\): \[\mathbf{C}\mathbf{C}^T = [(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}][(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}]^T\]
This gives us four terms: \[= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{D}^T\] \[+ \mathbf{D}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T\]
Since \(\mathbf{D}\mathbf{X} = \mathbf{0}\):
After cancellation: \[\mathbf{C}\mathbf{C}^T = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T\]
Which simplifies to: \[= (\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T\]
Therefore: \[\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2[(\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T]\]
\(\mathbf{D}\mathbf{D}^T\) is positive semi-definite
This means: \(\mathbf{D}\mathbf{D}^T \geq \mathbf{0}\) (in matrix sense)
Since \(\mathbf{D}\mathbf{D}^T \geq \mathbf{0}\): \[\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} = \text{Var}(\hat{\boldsymbol{\beta}})\]
We have shown:
And OLS has minimum variance among all linear unbiased estimators